Usage of GeRnika

Aitor Sánchez, Maitena Tellaetxe and Borja Calvo

2023-04-23

Introduction

This is a demo for using the GeRnika R package. This document contains examples to help any user to understand the usage of the functionalities offered by the package, which include the simulation of tumor clonal data and the visualization and comparison of tumor phylogenies.

Simulating tumor clonal data

The simulation of tumor clonal data consists of simulating a matrix \(F\) that contains the mutation frequency values of a series of mutations in a collection of tumor biopsies or samples. The matrix \(F\) is calculated as the product of a matrix \(B\) that represents the phylogeny of the tumor, and a matrix \(U\), which contains the clone proportions in each particular sample of that tumor.

Tumor data can be simulated through the create_instance function. The information about its parameters and their usage may be checked in the following table:

Parameter Description Type
n Number of mutations/clones. Discrete No.
m Number of samples. Discrete No.
k Topology parameter that controls for the linearity of the topology. Continuous No.
selection Evolution model followed by the tumor. “positive”,“neutral”
noise Add sequencing noise to the values in the \(F\) matrix. Boolean (TRUE by default)
depth Average number of reads that map to the same locus (in noisy cases), also known as sequencing depth. Discrete No. (30 by default)

The following is an example of the generation of a noise-free instance of a tumor that is composed of 5 clones/mutations that has evolved under neutral evolution and has a k value of 0.5. 5 samples have been taken from it:

I <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral")

This method returns the previously mentioned \(F\), \(B\) and \(U\) matrices and an additional \(F\_true\) matrix, which we will describe later.

Once we have shown an example of the instantiation of a tumor, we will analyze the effect of varying the values of the parameters.

Topology parameter k

k is the parameter that controls for the linearity of the topology of the tumor. As a result, increasing values of k lead to rather branched phylogenies, while lower values of k produce trees that tend to linearity.

# Simulate a tumor with k=0:
I1 <- create_instance(n = 5, m = 4, k = 0, selection = "neutral")

# Simulate a tumor with k=2:
I2 <- create_instance(n = 5, m = 4, k = 8, selection = "neutral")

# Create a `Phylotree` class object for each tumor:
tree1 <- B_to_phylotree(B = I1$B)
tree2 <- B_to_phylotree(B = I2$B)

# Plot both trees
plot(tree1)
plot(tree2)

On the left the plot of tree1 (k=0), on the right the plot of tree2 (k=8).

Following the above, the tree on the left is fully branched as it is composed by a root connected to all the leaves of the tree. On the right side we can see a more linear tree, with just two main branches.

After analyzing the effect of parameter k in the generation of tumor data, we will proceed to analyze the effect of the tumor evolution model.

Tumor evolution model

With this parameter we control for the evolution model the tumor follows, either positive selection or neutral evolution. A positive selection-driven evolution model assumes that a few mutations provide cell growth advantage, whereas the remaining mutations do not. Conversely, neutral evolution models assume that, in general, none of the mutations provide significant fitness advantage. As a consequence, tumors under positive selection are dominated by a few clones whereas the rest of clones are present in small proportions. Instead, in tumors with a neutral evolution, all the clones are present in similar proportions.

The clone proportions are well observed in the \(U\) matrix, as this indeed contains the fraction of each clone in each sample of the tumor. Below, we can see this effect with a few examples:

# Simulate a tumor with positive selection:
Ipos <- create_instance(n = 5, m = 8, k = 0.5, selection = "positive")

# Simulate a tumor with neutral evolution:
Ineu <- create_instance(n = 5, m = 8, k = 0.5, selection="neutral")

# Plot the heatmaps of the U matrices of the
U_to_heatmap(Ipos$U)
U_to_heatmap(Ineu$U)
Heatmaps of the $U$ matrices of an instance of a tumor under positive selection (top) and neutral evolution (bottom).

Heatmaps of the \(U\) matrices of an instance of a tumor under positive selection (top) and neutral evolution (bottom).

In that figure, we may see that in the tumor instance that evolves under neutral evolution, the majority of clones are present in all the samples, with only a few exceptions. Most of the clones have fraction values between 0.05 and 0.4. Conversely, the biggest fraction of the tumor instance under positive selection is taken by a few clones, in particular, by clone 3, and by clones 1 and 2 in a lower proportion. In addition, more clones than in the neutral evolution instance are absent; for instance, clone 5 is even missing in all the samples.

Once we have analyzed the difference between the neutral evolution and positive selection-driven evolution models, we will show how can noisy instances be generated and compare them to noise-free instances.

Sequencing noise

GeRnika has a functionality to add sequencing noise to the simulated instances. In practice, this is done by adding a few parameters to the create_instance function. Sequencing noise is added on top of the noise-free \(F\) matrix, and the \(U\) and \(B\) matrices suffer no changes. The F slot in the resulting object contains the noisy \(F\) matrix, while the F\_true slot consists of the original noise-free \(F\) matrix. Note that these two slots contain equal matrices when no sequencing noise is added.

Down below we compare an instance we have added sequencing noise to, with its noise-free counterpart.

# Simulate a tumor with sequencing noise added:
Inoisy <- create_instance(n = 5, m = 8, k = 0.5, selection = "neutral", noisy = TRUE, depth = 5)

# Show the heatmaps of the difference between the F and F_true matrices
F_to_heatmap(abs(Inoisy$F - Inoisy$F_true))
The effect of noise.

The effect of noise.

The heatmap from above shows the difference between the \(F\) matrix and the \(F\_true\) matrix of a tumor instance, i.e. the noise added to the original VAF values of our samples. The amount of noise added is controlled by the depth parameter, which replicates the effect that the sequencing depth has on the noise level. This is explained in more detail in the following subsection.

Sequencing read depth

The sequencing read depth is the average number of reads that map to the same locus (section of the genome). Therefore, the higher the sequencing depth, the most accurate the VAF values and thus, the lower the noise will be.

See the evolution of the produced noise-error for simulations with different depth values below. The first animation below presents the progression of the error for two tumors composed by 10 clones and 2 samples, considering positive selection and neutral evolution. Contrarily, the second animation shows the progression of the error for two tumors (again with positive selection and neutral evolution) with 100 clones and 10 samples.

The evolution of the error in VAF values based on the read depth sequencing values for a tumor composed by 10 clones and 2 samples.

The evolution of the error in VAF values based on the read depth sequencing values for tumors composed by 100 clones and 10 samples.

Note that the error decreases as depth values get higher, whether they follow a neutral evolution or a positive selection. In addition, the error of bigger tumors converges to values near 0 faster than the error of smaller ones. However, the first animation shows that, even if the smaller tumor needs higher values of depth to converge, it ends up producing minor errors.

After analyzing the different parameters for simulating tumor clonal data, we will present the basis of the Phylotree S4 class.

The Phylotree S4 class

In this section, we will be using the Phylotree class for the purpose of visualizing phylogenetic trees on the basis of simulated tumor clonal data. The Phylotree S4 class is a structure that provides facilities for constructing tumor phylogenetic trees. We will now show the composition of Phylotree class objects and the different ways for instantiating them.

Structure of Phylotree class objects

As every S4 R class, the Phylotree class is composed of various attributes that are essential for building a tumor phylogenetic tree in an eficient way.

The attributes of Phylotree class may be visualized in the following table:

Parameter Description Type
B Binary square matrix containing the mutations of each clone in the tumor Matrix
clones Clones in the \(B\) matrix of the tumor Vector
genes Genes represented in the \(B\) matrix of the tumor Vector
parents Parents of the clones in the phylogenetic tree Vector
tree Structure that represents the phylogenetic tree Node
labels Tags of the genes in the phylogenetic tree Vector

Beware that the users of this package will not need to manipulate all the parameters of the Phylotree class as some of these exist only to reduce the computational cost of the visualization of phylogenetic trees and their comparison.

Once we have explained the structure of Phylotree class objects, we will proceed to show the different ways for instantiating them.

Instantiation of Phylotree class objects

The mutations of each clone in a tumor sample are represented in an n x n binary genotype \(B\) matrix. Each \(b_{i}\) row of the \(B\) matrix represents the mutations in clone \(v_{i}\). Note that we just need a \(B\) matrix of a tumor simulation in order to instantiate a new Phylotreeclass object.

Now, we will show an example of the instantiation of a Phylotree class object:

# Simulate a tumor with 5 clones, 4 samples, k=0.5:
instance <- create_instance(n = 5, m = 4, k = 0.5, selection = "positive", noisy = FALSE)

# The creation of the Phylotree class object using the previously generated B matrix:
phylotree <- B_to_phylotree(B = instance$B)

The B_to_phylotree method takes the \(B\) matrix of a tumor simulation as its main argument and calculates the values for the other attributes present in Phylotree class objects. After instantiating the new Phylotree object, we can visualize it using the generic plot method for this class:

plot(phylotree)

Phylogenetic tree composed by 5 nodes.

Instead of clone numbers, the user can use any other labelling for the nodes in the tree. The example below illustrates how this can be done:

# Create a vector with the tags we want to use (the length of the 
# vector must be equal to the number of clones in the phylogenetic tree):
tags <- c("mut1", "mut2", "mut3", "mut4", "mut5")

# Create the Phylotree class object and include the node labels:
phylotree <- B_to_phylotree(B = instance$B, labels = tags)

After creating the Phylotree class object, we may render it with the custom labels in the following way:


plot(phylotree, labels = TRUE)

Phylogenetic tree of 5 clones with custom tags. The tags in the vector are assigned to the nodes in the tree according to their order. For instance, the first clone in the previous plot is now represented with the first label of the tag vector mut1.

This is one of the possible methods for instantiating Phylotree class objects on the basis of the \(B\) matrix of a tumor. However, this package also grants the option of using the general constructor for the Phylotree S4 class for instantiating new Phylotree objects, which allows users to give specific values to the attributes of a new Phylotree class object.

Resizing nodes to clone proportions

When plotting a Phylotree class object, its nodes can be resized according to some given proportions of the clones that compose the tumor samples. To do this, an additional \(U\) matrix determining the proportions of the clones can be passed to the create_instance function.

# Simulate a tumor
instance <- create_instance(n = 5,m = 4, k = 0.5, selection = "neutral", noisy = FALSE)

# Use the B matrix to instantiate a Phylotree class object
tree <- B_to_phylotree(B = instance$B)

# Plot the phylogenetic tree and resize the nodes according to the U matrix
plot_proportions(tree, instance$U)

Phylogenetic tree of 5 clones using proportions.

Again, in this case we can also use custom labels for the nodes:

tags <- c("A", "B", "C", "D", "E")

# Simulate a tumor
instance <- create_instance(n = 5, m = 4, k = 0.5, selection = "neutral", noisy = FALSE)

# Use the B matrix of the previously generated tumor for creating a Phylotree class object
tree<-B_to_phylotree(B = instance$B, labels = tags)

# Plot the phylogenetic tree, resize the nodes according to the U matrix and include the node labels
plot_proportions(tree, instance$U, labels = TRUE)

Phylogenetic tree of 5 clones using proportions and tags.

Note that the proportions parameter can be a matrix or a vector. If a matrix is given to the function, this method will generate one plot per row in the matrix. Conversely, if a vector is given, a single plot will be generated.

Once we have shown the usage of the methods for instantiating Phylotree class objects and the procedures these can be visualized by, we will present the functions for comparing different tumor phylogenies.

Comparing and combining phylogenetic trees

GeRnika presents different functionalities for comparing tumor phylogenies. In order to show them, we will use the B\_mats object included in theGeRnika package. This contains 10 \(B\) matrix trios. Each trio corresponds to a different instance, and the \(B\) matrices arise in the course of solving the Clonal Deconvolution and Evolution Problem (CDEP) for that instance, as follows:

ices based on the solution of various instances of the Clonal Deconvolution and Evolution Problem given by the ILS and GRASP methods. This trios consist of the following matrices:

The example below loads and plots the 3 \(B\) matrices of an instance in B_mats:

# Load the predefined B matrices of the package:
B_mats <- GeRnika::B_mats

# Get the 3 B matrices from B_mats for comparing them:
B_real <- B_mats[[2]]$B_real
B_grasp <- B_mats[[2]]$B_grasp
B_opt <- B_mats[[2]]$B_opt

# Create the vector of labels for the clones that compose the phylogenetic trees:
tags <- c("mut1", "mut2", "mut3", "mut4", "mut5", "mut6", 
            "mut7", "mut8", "mut9", "mut10")

# Create the Phylotree class objects using the previously loaded B matrices:
phylotree_real <- B_to_phylotree(B = B_real, labels = tags)
phylotree_grasp <- B_to_phylotree(B = B_grasp, labels = tags)
phylotree_opt <- B_to_phylotree(B = B_opt, labels = tags)


# Render all trees:
plot(phylotree_real)
plot(phylotree_grasp)
plot(phylotree_opt)
Phylogenetic trees according to a trio of \(B\) matrices in B_mats.

As these trees above relate to the same tumor instance, it is reasonable that they may present some commonalities. Now, we will show the three different methods offered by the GeRnika package for comparing tumor phylogenies.

The equals method

The three phylogenetic trees above are not equal. For example, phylotree_real and phylotree_opt are not equal as some of the edges of phylotree_real do not exist in phylotree_opt and the other way around.

The equivalence between two phylogenetic trees may be checked by using the equals method as follows:

# Checking if phylotree_real is equal to itself:
equals(phylotree_1 = phylotree_real, phylotree_2 = phylotree_real)
#> [1] TRUE

# Checking if phylotree_real and phylotree_opt are equal:
equals(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)
#> [1] FALSE

Equal phylogenetic trees, by definition, are composed by the same nodes, connected by the same edges. As a result, this method returns TRUE whenphylotree_real is compared with itself. Likewise, as phylotree_real and phylotree_opt are not equal, this method returns FALSE when we check whether they are equal or not.

Nevertheless, even though two trees may not be equal, they may have common subtrees.

The find_common_subtrees method

The find_common_subtrees function calculates and plots all the maximal common subtrees between two phylogenetic trees. Moreover, it also outputs the number of common and independent (the ones not shared by both trees) edges of the trees together. Finally, the method outputs the distance between the trees, which is computed as the sum of their independent edges.

find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp)
#> Independent edges of tree1:  6 
#> Independent edges of tree2:  6 
#> Common edges:  3 
#> Distance:  12
The maximal common subtrees between phylotree_real and phylotree_grasp.
find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)
#> Independent edges of tree1:  3 
#> Independent edges of tree2:  3 
#> Common edges:  6 
#> Distance:  6
The maximal common subtrees between phylotree_real and phylotree_opt.

We observe that there are two maximal common subtrees between phylotree_real and phylotree_grasp. Contrariwise, phylotree_real and phylotree_opt present a single maximal common subtree that covers the biggest part of both trees.

As before, we can add custom labels to the trees that result from this function call:

find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp, labels = TRUE)
#> Independent edges of tree1:  6 
#> Independent edges of tree2:  6 
#> Common edges:  3 
#> Distance:  12
The maximal common subtrees between phylotree_real and phylotree_grasp labelled using custom tags.
find_common_subtrees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt, labels = TRUE)
#> Independent edges of tree1:  3 
#> Independent edges of tree2:  3 
#> Common edges:  6 
#> Distance:  6
The common subtrees between phylotree_real and phylotree_opt labelled using custom tags.

According to the plots from above, we conclude that phylotree_real shares more edges with phylotree_opt than it does with ´phylotree_grasp`.

GeRnikaoffers an additional method to compare two phylogenetic trees: a consensus tree that shows the union of the nodes and edges of both trees.

The combine_trees method

The combine_trees function creates a tree that results from the union of the nodes and edges of two trees. We name this the consensus tree. In this, the nodes and the edges that compose the common subtrees between the original trees are blue. In addition, yellow edges denote to the independent edges of the tree passed as the first parameter of the method, while orange edges represent the independent edges of the second tree. Note that the independent edges of both trees are presented with more translucent colors.

# Creating the consensus tree between phylotree_real and phylotree_grasp
consensus_real_grasp <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_grasp)


# Creating the consensus tree between phylotree_real and phylotree_opt
consensus_real_opt <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt)

The consensus tree can be plotted by using the function render_graph of the DiagrammeR R package.

# Rendering the consensus between phylotree_real and phylotree_grasp
render_graph(consensus_real_grasp)
The consensus tree between phylotree_real and phylotree_grasp.
# Rendering the consensus between phylotree_real and phylotree_opt
render_graph(consensus_real_opt)
The consensus tree between phylotree_real and phylotree_opt.

The above figures present the consensus tree between phylotree_real and phylotree_opt and the consensus tree between phylotree_real and phylotree_grasp, respectively.

Users can customize both the colors and the labels of the nodes. The latter can be done by providing a custom palette –a vector containing the hexadecimal code of various colors– of 3 elements such as c("#AAAAAAAA","#BBBBBBBB","#CCCCCCCC"). Additionally, the user can use one of the color palettes included in GeRnika: Lancet, NEJM and Simpsons (default).

# Load one of the default palettes of the package:
palette <- GeRnika::palettes$Lancet

# Create the consensus tree between phylotree_real and phylotree_opt 
# using clone tags and the custom palette:
consensus <- combine_trees(phylotree_1 = phylotree_real, phylotree_2 = phylotree_opt, 
                           labels = TRUE, palette = palette)

# Render the new consensus phylogenetic tree:
render_graph(consensus)
The consensus tree between phylotree_real and phylotree_opt using tags and a selected color palette.

Note that the parameter palette of this method may take a palette –a vector containing the hexadecimal code of various colors– composed by three colors in order to use them for building the consensus tree.

Session information

This is the information of the system this document was compiled on:

#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggpubr_0.5.0     GeRnika_0.99.0   dplyr_1.0.10     colorspace_2.0-3
#> [5] reshape2_1.4.4   ggplot2_3.4.0    purrr_0.3.5      DiagrammeR_1.0.9
#> [9] data.tree_1.0.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0   xfun_0.34          bslib_0.4.1        carData_3.0-5     
#>  [5] vctrs_0.5.1        generics_0.1.3     htmltools_0.5.4    yaml_2.3.6        
#>  [9] utf8_1.2.2         rlang_1.0.6        jquerylib_0.1.4    pillar_1.8.1      
#> [13] glue_1.6.2         withr_2.5.0        DBI_1.1.3          RColorBrewer_1.1-3
#> [17] lifecycle_1.0.3    plyr_1.8.8         stringr_1.4.1      munsell_0.5.0     
#> [21] ggsignif_0.6.4     gtable_0.3.1       visNetwork_2.1.2   htmlwidgets_1.6.1 
#> [25] evaluate_0.18      labeling_0.4.2     knitr_1.41         fastmap_1.1.0     
#> [29] fansi_1.0.3        highr_0.9          broom_1.0.1        Rcpp_1.0.9        
#> [33] scales_1.2.1       backports_1.4.1    cachem_1.0.6       jsonlite_1.8.3    
#> [37] abind_1.4-5        farver_2.1.1       digest_0.6.30      stringi_1.7.8     
#> [41] rstatix_0.7.1      cowplot_1.1.1      grid_4.2.1         cli_3.4.1         
#> [45] tools_4.2.1        magrittr_2.0.3     sass_0.4.3         tibble_3.1.8      
#> [49] car_3.1-1          tidyr_1.2.1        pkgconfig_2.0.3    ellipsis_0.3.2    
#> [53] assertthat_0.2.1   rmarkdown_2.18     rstudioapi_0.14    R6_2.5.1          
#> [57] compiler_4.2.1